Method and system for managing computer systems

ABSTRACT

A management system for a computer system is disclosed. The computer system operates or includes various products (e.g., software products) that can be managed in a management system or collectively by a group of management systems. Typically, the management system operates on a computer separate from the computer system being managed. The management system can make use of a knowledge base of causing symptoms for previously observed problems at other sites or computer systems. In other words, the knowledge base can built from and shared by different users across different products to leverage knowledge that is otherwise disparate. The knowledge base typically grows over time. The management system can use its ability to request information from the computer system being managed together with the knowledge base to infer a problem root cause in the computer system being managed. The computer system being managed can also request the management system to process its knowledge base for possible problem cause analysis. The management system can also continually identify persisting problem causing symptoms.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.13/662,680, filed Oct. 29, 2012, and entitled “METHOD AND SYSTEM FORMANAGING COMPUTER SYSTEMS,” (now U.S. Pat. No. 9,020,877), which ishereby incorporated by reference, which is a divisional application ofU.S. patent application Ser. No. 12/661,244, filed Mar. 12, 2010, andentitled “METHOD AND SYSTEM FOR MANAGING COMPUTER SYSTEMS,” (now U.S.Pat. No. 8,301,580), which is hereby incorporated by reference, which isa divisional application of U.S. patent application Ser. No. 11/585,660,filed Oct. 23, 2006, and entitled “METHOD AND SYSTEM FOR MANAGINGCOMPUTER SYSTEMS,” (now U.S. Pat. No. 7,707,133), which is herebyincorporated by reference herein, which is a continuation of U.S. patentapplication Ser. No. 10/412,639, filed Apr. 10, 2003, and entitled“METHOD AND SYSTEM FOR MANAGING COMPUTER SYSTEMS,” which is herebyincorporated by reference herein, and which in turn claims the prioritybenefit of: (i) U.S. Provisional Patent Application No. 60/371,659,filed Apr. 10, 2002, and entitled “METHOD AND SYSTEM FOR MANAGINGCOMPUTER SYSTEMS,” which is hereby incorporated by reference herein; and(ii) U.S. Provisional Patent Application No. 60/431,551, filed Dec. 5,2002, and entitled “METHOD AND SYSTEM FOR MANAGING COMPUTER SYSTEMS,”which is hereby incorporated by reference herein.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to computer systems and, moreparticularly, to management of computer systems.

2. Description of the Related Art

Today's computer systems, namely enterprise computer systems, make useof a wide range of products. The products are often applications, suchas operating systems, application servers, database servers, JAVAVirtual Machines, etc. These computer systems often suffer from networkand system-related problems. Unfortunately, given the complex mixture ofproducts concurrently used by such computer systems, there is greatdifficultly in identifying and isolating of application-relatedproblems. Typically, when a problem occurs on a computer system, it mustfirst be isolated to a particular computer system out of many differentcomputer systems or to the network interconnect among these systems andalso to a particular application out of many different applications usedby the computer system. However, conventionally speaking, isolating theproblem is difficult, time consuming and requires a team of applicationexperts with different domain expertise. These experts are expensive,and the resulting down time of computer systems is very expensive toenterprises.

Although management solutions have been developed, such solutions arededicated to particular customers and/or specific products. Monitoringsystems are able to provide monitoring for events, but offer nomeaningful management of non-catastrophic problems and prevention ofcatastrophic problems. Hence, conventional managing and monitoringsolutions are dedicated approaches that are not generally usable acrossdifferent computer systems using combinations of products.

Thus, there is a need for improved management systems that are able toefficiently manage computer systems over a wide range of products.

SUMMARY OF THE INVENTION

Broadly speaking, the invention relates to a management system for acomputer system. The computer system operates or includes variousproducts (e.g., software products) that can be managed in a managementsystem or collectively by a group of management systems. Typically, themanagement system operates on a computer separate from the computersystem being managed. The management system can make use of a knowledgebase of causing symptoms for previously observed problems at other sitesor computer systems. In other words, the knowledge base can be builtfrom and shared by different users across different products to leverageknowledge that is otherwise disparate. The knowledge base typicallygrows over time. The management system can use its ability to requestinformation from the computer system being managed together with theknowledge base to infer a problem root cause in the computer systembeing managed. The computer system being managed can also request themanagement system to process its knowledge base for possible problemcause analysis. The management system can also continually identifypersisting problem causing symptoms.

The invention can be implemented in numerous ways including, as amethod, system, apparatus, and computer readable medium. Severalembodiments of the invention are discussed below.

As a management system for a computer system, one embodiment of theinvention can, for example, include at least: a plurality of agentsresiding within managed nodes of a plurality of different products usedwithin the computer system, and a manager for said management system.The manager is operable across the different products.

As a method for isolating a root cause of a software problem in anenterprise computer system supporting a plurality of software products,one embodiment of the invention can, for example, include at least:forming a knowledge base from causing symptoms and experienced problemsprovided by a disparate group of personal contributors; and examiningthe knowledge base with respect to the software problem to isolate thecause of the software problem to one of the software products.

As a method for managing an enterprise computer system, one embodimentof the invention can, for example, include at least the acts of:receiving a fact pertaining to a condition of one of a plurality ofdifferent products that are operating in the enterprise computer system;asserting the fact with respect to an inference engine, the inferenceengine using rules based on facts; retrieving updated facts from theinference engine from those of the rules that are dependent on the factthat has been asserted; and performing an action in view of the updatedfacts.

As a computer readable medium including at least computer program codestored therein for isolating a root cause of a problem in an enterprisecomputer system supporting a plurality of products, one embodiment ofthe invention can, for example, include at least: computer program codefor accessing a knowledge base that is formed from causing symptoms andexperienced problems provided by a disparate group of personalcontributors; and computer program code for examining the knowledge basewith respect to the problem to isolate the cause of the problem to oneof the products.

Other aspects and advantages of the invention will become apparent fromthe following detailed description, taken in conjunction with theaccompanying drawings, illustrating by way of example the principles ofthe invention.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be readily understood by the followingdetailed description in conjunction with the accompanying drawings,wherein like reference numerals designate like structural elements, andin which:

FIG. 1 is a block diagram of a management system according to oneembodiment of the invention.

FIG. 2 is a block diagram of a manager for a management system accordingto one embodiment of the invention.

FIG. 3 is a block diagram of a GUI (Graphical User Interface) accordingto one embodiment of the invention.

FIG. 4 is a block diagram of a knowledge manager according to oneembodiment of the invention.

FIG. 5A is a diagram of a directed graph representing a knowledge base.

FIG. 5B represents a small portion of knowledge provided in a segment ofa directed graph (e.g., directed graph).

FIG. 5C represents a small portion of knowledge provided in a segment adirected graph (e.g., directed graph).

FIG. 6 is a block diagram of a knowledge processor according to oneembodiment of the invention.

FIG. 7 is a block diagram of a management framework interface accordingto one embodiment of the invention.

FIG. 8 is a block diagram of a report module according to one embodimentof the invention.

FIG. 9A is a diagram illustrating a knowledge base according to oneembodiment of the invention.

FIG. 9B is an architecture diagram for a rule pack according to oneembodiment of the invention.

FIG. 10 illustrates a relationship between facts, rules and actions.

FIG. 11 illustrates an object diagram for a representative knowledgerepresentation.

FIG. 12 is a block diagram of the managed node according to oneembodiment of the invention.

FIG. 13 is a block diagram of an agent according to one embodiment ofthe invention.

FIG. 14 is a block diagram of a master agent according to one embodimentof the invention.

FIG. 15 is a block diagram of a sub-agent according to one embodiment ofthe invention.

FIGS. 16A and 16B are flow diagrams of manager startup processingaccording to one embodiment of the invention.

FIGS. 16C-16E are flow diagrams of manager startup processing accordingto another embodiment of the invention.

FIG. 17A is flow diagram of master agent startup processing according toone embodiment of the invention.

FIG. 17B is a flow diagram of sub-agent startup processing according toone embodiment of the invention.

FIGS. 18A and 18B are flow diagrams of trigger/notification processingaccording to one embodiment of the invention.

FIG. 19 is a flow diagram of GUI report processing according to oneembodiment of the invention.

FIGS. 20-29 are screen shots of a representative Graphical UserInterface (GUI) suitable for use with one embodiment of the presentinvention.

DETAILED DESCRIPTION OF THE INVENTION

The invention pertains to a management system for a computer system(e.g., an enterprise computer system). The computer system operates orincludes various products (e.g., software products) that can be managedin a management system or collectively by a group of management systems.Typically, the management system operates on a computer separate fromthe computer system being managed. The management system can make use ofa knowledge base of causing symptoms for previously observed problems atother sites or computer systems. In other words, the knowledge base canbe built from and shared by different users across different products toleverage knowledge that is otherwise disparate. The knowledge basetypically grows over time. The management system can use its ability torequest information from the computer system being managed together withthe knowledge base to infer a problem root cause in the computer systembeing managed. The computer system being managed can also request themanagement system to process its knowledge base for possible problemcause analysis. The management system can also continually identifypersisting problem causing symptoms.

Embodiments of the invention are discussed below with reference to FIGS.1-29. However, those skilled in the art will readily appreciate that thedetailed description given herein with respect to these figures is forexplanatory purposes as the invention extends beyond these limitedembodiments.

FIG. 1 is a block diagram of a management system 100 according to oneembodiment of the invention. The management system 100 serves to managea plurality of managed nodes 102-1, 102-2, . . . , 102-n. Each of themanaged nodes 102-1, 102-2, . . . , 102-n respectively includes an agent104-1, 104-2, . . . , 104-n. These agents 104 serve to monitor andmanage products at the managed nodes 102. In one implementation, theagents 104 are stand alone processes operating in their own processspace. In another implementation, the agents 104 are specific toparticular products being managed and reside at least partially withinthe process space of the products being managed. The agents 104 canmonitor and collect data pertaining to the products. Since the productscan utilize an operating system or network coupled to the managed nodes,the agents 104 are also able to collect state information pertaining tothe operating system or the network. In still another implementation,the agents 104 are an embodiment of Simple Network Management Protocol(SNMP) agents available from third-parties or system vendors.

The agents 104 can be controlled to monitor specific information (e.g.,resources) with respect to user-configurable specifics (e.g.,attributes). The information (e.g., resources) being monitored can havezero or more layers or depths of specifics (e.g., attributes). Themonitoring of the information can be dynamically on-demand orperiodically performed. The information being monitored can be focusedor limited to certain details as determined by the user-configurablespecifics (e.g., attributes). For example, the information beingmonitored can be focused or limited by certain levels/depths.

Optionally, the agents 104 can also be capable of performing certainstatistical analysis on the data collected at the managed nodes. Forexample, the statistical analysis on the data might pertain to runningaverage, standard deviation, or historical maximum and minimum.

The management system 100 also includes a management framework 106. Themanagement framework 106 facilitates communications between the agents104 for the managed nodes 102 and the manager 108. For example,different agents 104 can utilize different protocols (namely, managementprotocols) to exchange information with the management framework 106.

The management system 100 also includes a manager 108. The manager 108serves to manage the management system 100. Consequently, the manager108 can provide cross-products, cross-systems and multi-systemsmanagement in a centralized manner, such as for an enterprise networkenvironment having multiple products or applications which servedifferent types of requests. In an enterprise network environment, themanager 108 has the ability to manage the various systems therein andtheir products and/or applications through a single entity.Geographically, these systems and products and/or applications can becentrally located or distributed locally or remotely (even globally).

FIG. 2 is a block diagram of a manager 200 for a management systemaccording to one embodiment of the invention. For example, the manager200 illustrated in FIG. 2 can pertain to the manager 108 illustrated inFIG. 1.

The manager 200 includes a Graphical User Interface (GUI) 202 thatallows a user (e.g., an administrator) to interact with the manager 200to provide user input. The user input can pertain to rules, resources orsituations. In addition, the user input with the GUI 202 can pertain toadministrative or configuration functions for the manager 200 or outputinformation (e.g., reports, notifications, etc.) from the manager 200.The input data is supplied from the GUI 202 to a knowledge manager 204.The knowledge manager 204 confirms the validity of the rules, resourcesor situations and then converts such rules, resources or situations intoa format being utilized for storage in a knowledge base 206. In oneimplementation, the format pertains to meta-data represented as JAVAproperties. The knowledge base 206 stores the rules, resources andsituations within the database in a compiled code format.

The manager 200 also includes a knowledge processor 208. The knowledgeprocessor 208 interacts with the knowledge manager 204 to processappropriate rules within the knowledge base 206 in view of any relevantsituations or resources. In processing the rules, the knowledgeprocessor 208 often requests data from the agents 104 at the managednodes. Such requests for data are initiated by the knowledge processor208 and performed by way of a data acquisition unit 210 and a managementframework interface 212. The returned data from the agents 104 isreturned to the knowledge processor 208 via the data acquisition unit210 and the management framework interface 212. With such monitored datain hand, the knowledge processor 208 can evaluate the relevant rules.When the rules (evaluated by the knowledge processor 208 in accordancewith the monitored data received from the agents 104) indicate that aproblem exists, then a variety of different actions can be performed. Acorrective action module 213 can be initiated to take corrective actionwith respect to resources at the particular one or more managed nodesthat have been identified as having a problem. Further, if debugging isdesired, a debug module 214 can also be activated to interact with theparticular managed nodes to capture system data that can be utilized indebugging the particular system problems.

The knowledge processor 208 can periodically, or on a scheduled basis,perform certain of the rules stored within the knowledge base 206. Thenotification module 216 can also initiate the execution of certain ruleswhen the notification module 216 receives an indication from one of theagents 104 via the management framework interface 212. Typically, theagents 104 would communicate with the notification module 216 using anotification that would specify a management condition that the agent104 has sent to the manager 200 via the management framework 106.

In addition, the manager 200 also includes a report module 218 that cantake the data acquired from the agents 104 as well as the results of theprocessed rules (including debug data as appropriate) and generate areport for use by the user or administrator. Typically, the reportmodule 218 and its generated reports can be accessed by the user oradministrator through the GUI 202. The manager 200 also includes a logmodule 220 that can be used to store a log of system conditions. The logof system conditions can be used by the report module 218 to generatereports.

The manager 200 can also include a security module 222, a registry 224and a registry data store 226. The security module 222 performs userauthentication and authorization. Also, to the extent encoding is used,the security module 222 also perform encoding or decoding (e.g.,encryption or decryption) of information. The registry 224 and theregistry data store 226 serve to serve and store structured informationrespectively. In one implementation, the registry data store 226 servesas the physical storage of certain resource information, configurationinformation and compiled knowledge information from the knowledgebase206.

Still further, the manager 200 can include a notification system 228.The notification system 228 can use any of a variety of differentnotification techniques to notify the user or administrator that certainsystem conditions exist. For example, the communication techniques caninclude electronic mail, a pager message, a voice message or afacsimile. Once notified, the notified user or administrator can gainaccess to a report generated by the report module 218.

The debug module 214 is able to be advantageously initiated when certainconditions exist within the system. Such debugging can be referred to as“just-in-time” debugging. This focuses the capture of data for debugpurposes to a constrained time period in specific areas of interest suchthat more relevant data is able to be captured.

FIG. 3 is a block diagram of a GUI 300 according to one embodiment ofthe invention. The GUI 300 is, for example, suitable for use as the GUI202 illustrated in FIG. 2.

The GUI 300 includes a knowledge input GUI 302, a report output GUI 304,and an administrator GUI 306. The knowledge input GUI 302 provides agraphical user interface that facilitates interaction between a user(e.g., administrator) and a manager (e.g., the manager 200). Hence,using the knowledge input GUI 302, the user or administrator can enterrules, resources or situations to be utilized by the manager. The reportoutput GUI 304 is a graphical user interface that allows the user toaccess reports that have been generated by a report module (e.g., thereport module 218). Typically, the report output GUI 304 would not onlyallow initial access to such reports, but would also provide a means forthe user to acquire additional detailed information about reportedconditions. For example, the report output GUI 304 could enable a userto view a report on chosen criteria such as case ID or a period of time.The administrator GUI 306 can allow the user to configure or utilize themanager. For example, the administrator GUI 306 can allow creation ofnew or modification to existing users and their access passwords,specific information about managed nodes and agents (includingmanaged-node IP and port, agent name, agent types), electronic mailserver and user configuration.

FIG. 4 is a block diagram of a knowledge manager 400 according to oneembodiment of the invention. The knowledge manager 400 is, for example,suitable for use as the knowledge manager 204 illustrated in FIG. 2.

The knowledge manager 400 includes a knowledge code generator 402. Inparticular, the knowledge code generator 402 receives rules ordefinitions (namely, definitions for resources or situations) and thengenerates and outputs knowledge code to a knowledge processor, such asthe knowledge processor 208. In one implementation, the knowledge codegenerator 402 can be considered a compiler, in that the rules ordefinitions are converted into a data representation suitable forexecution. The knowledge code can be a program code or it can be ameta-language. In one implementation, the knowledge code is executableby an inference engine such as JESS. Additional information on JESS isavailable at “herzberg.ca.sandia.gov/jess” as an example.

The knowledge manager 400 also includes a knowledge encoder/decoder 404,a knowledge importer/exporter 406 and a knowledge update manager 408.The knowledge encoder/decoder 404 can perform encoding when storingknowledge to the knowledge base 206 or decoding when retrievingknowledge from the knowledge base 206. The knowledge importer/exporter406 can import knowledge from another knowledge base and can exportknowledge to another knowledge base. In general, the knowledge updatemanager 408 serves to manually or automatically update the knowledgebase 206 with additional sources of knowledge that are available andsuitable. In one embodiment, the knowledge update manager 408 operatesto manage the general coherency of the knowledge base 206 with respectto a central knowledge base. Typically, the knowledge base 206 storedand utilized by the knowledge manager 400 is only a relevant portion ofthe central knowledge base for the environment that the knowledgemanager 400 operates.

FIG. 5A is a diagram of a directed graph 500 representing a knowledgebase. The knowledge base represented by the directed graph 500 is, forexample, suitable for use as the knowledge base 206 illustrated in FIG.2. The directed graph 500 represents a pictorial view of the knowledgecode resulting from rules, situations and resources.

The directed graph 500 is typically structured to include base resourcesat the top of the directed graph 500, situations/resources in a middleregion of the directed graph 500, and actions (action resources) at thebottom (or leaf nodes) of the directed graph 500. In particular, node502 pertains to a base resource or resources and node 504 pertains tosituation and/or resource. A relationship 506 between the nodes 502 and504 is determined by the rule being represented by the directional arrowbetween the nodes 502 and 504. The situation/resource at node 504 inturn relates to another situation/resource at node 508. A relationship510 relates the nodes 504 and 508, namely, the relationship 510 isdetermined by the rule represented by the directional arrow between thenodes 504 and 508. The situations/resources at nodes 504 and 508together with the relationship 510 pertain to another rule. Thesituation/resource at node 508 is further related to an action resourceat node 512. A relationship 514 between the situation/resource at node508 and the action resource at node 512 is determined by still anotherrule, namely, an action rule.

The knowledge base represented by the directed graph 500 is flexible andextendible given the hierarchical architecture of the directed graph500. Hence, the knowledge base is able to grow over time to addcapabilities without negatively affecting previously existing knowledgewithin the knowledge base. The knowledge base is also able to be dividedor partitioned for different users, applications or service plans. Ineffect, as the knowledge base grows, the directed graph 500representation grows to add more nodes, such nodes representingsituations or resources as well as relationships (i.e., rules) betweennodes.

FIG. 5B represents a small portion of knowledge provided in a segment520 of a directed graph (e.g., directed graph 500). The segment 520includes nodes 522, 526, 530 and 534, and relationships 524, 528 and532. The node 522 pertains to a resource, namely, heap size of JavaVirtual Machine (JVM) in use. The relationship 524 indicates that whenthe node 522 is triggered, the node 526 is triggered. The node 526pertains to a resource, namely, maximum heap size of JVM. Therelationship 528 evaluates whether the maximum heap size for JVM is lessthan 1/0.8 percent the heap size for JVM. When the relationship 528 istrue, then the node 530 is triggered to acquire a resource, namely,TopHeapObjects for JVM, which is a debugging resource that obtains theinformation about the objects that are consuming the most amount of JVMheap. The specifics of this resource include the resource consumptionselected by cumulative size or the number of objects, the count of thedistinct objects, the selection of objects by JAVA classes they belongto are described by the attributes of the resource. The relationship 532then always causes the node 534 to invoke a resource action, namely,initiating an allocation trace for JVM. The specifics of this resourceselectable by its attributes can include but not limited to the classesof objects to trace, the time-period for tracing, and the depth of stackto which to limit every trace.

FIG. 5C represents a small portion of knowledge provided in a segment540 of a directed graph (e.g., directed graph 500). The segment 540includes nodes 542, 546, 550, 554 and 558, and relationships 544, 548,552 and 556. The node 542 pertains to a situation, namely, a JVMexception. The relationship 544 causes the node 546 to invoke a filteroperation when the situation at node 542 is present. The filteroperation at node 546 is a search expression that searches the JVMexception resource information received from agent 104 for an attribute“ORA-00018” which represents a particular problem with Oracle database,namely, the Oracle database running out of database connections for themanaged JAVA application to use. When the search expression is found,the relationship 548 causes the node 550 to trigger. At node 550, aresource for maximum users configured for the Oracle database being usedby the managed JAVA application is obtained. Then, the relationship 552determines whether the maximum users for the Oracle product is less thanfifty (50) and, if so, the node 554 invokes an action, namely, an emailnotification is sent. In addition, the relationship 556 always triggersthe node 558 to acquire a resource pertaining to the number of connectedusers the relevant Oracle database. The two rules, one rule representedby resources 542, 546, 550, 558 and the relationships 544, 548, 556, andthe second rule represented by the resources 550, 554 and therelationship 552 are two distinct rules defined using GUI 202 atdifferent times and possibly by different users and without needing toknow about the existence of the second rule while defining the first onerule and vice versa. The knowledgebase automatically links or chainsthese rules through the commonality of the resources (e.g., Oraclemaximum configured users resource 550 in the this example.

FIG. 6 is a block diagram of a knowledge processor 600 according to oneembodiment of the invention. The knowledge processor 600 is, forexample, suitable for use as the knowledge processor 208 for the manager200 illustrated in FIG. 2.

The knowledge processor 600 includes a controller 602 that couples to aknowledge manager (e.g., the knowledge manager 204). The controller 602receives the knowledge code from the knowledge manager and directs it toan inference engine 604 to process the knowledge code. In oneembodiment, the knowledge code is provided in an inference language suchthat the inference engine 604 is able to execute the knowledge code.

In executing the knowledge code, the inference engine 604 will typicallyinform the controller 602 of the particular data to be retrieved fromthe managed nodes via the agents and the management framework interface.In this regard, the controller 602 will request the data via amanagement interface 606 to a management framework. The returned datafrom the managed nodes is then returned to the controller 602 via themanagement interface 606. Alternatively, in executing the knowledgecode, exceptions (i.e., unexpected events) can be generated at themanaged nodes and pushed through the management interface 606 to thecontroller 602. In either case, the controller 602 then forwards thereturned data to the inference engine 604. At this point, the inferenceengine 604 can continue to process the knowledge code (e.g., rules). Theinference engine 604 may utilize a rule evaluator 608 to assist withevaluating the relationships or rules defined by the knowledge code. Therule evaluator 608 can perform not only the relationship checking forrules but also data parsing. Once the knowledge code has been executed,the inference engine 604 can inform the controller 602 to have variousoperations performed. These operations can include capturing ofadditional data from the managed nodes, initiating debug operations,initiating corrective actions, initiating logging of information, orsending of notifications.

The knowledge processor 600 also can include a scheduler 610. Thescheduler 610 can be utilized by the inference engine 604 or thecontroller 602 to schedule a future action, such as the retrieval ofdata from the managed nodes.

FIG. 7 is a block diagram of a management framework interface 700according to one embodiment of the invention. The management frameworkinterface 700 is, for example, suitable for use as the managementframework interface 212 illustrated in FIG. 2.

The management framework interface 700 includes a SNMP adapter 702 and astandard management framework adapter 704. The SNMP adapter 702 allowsthe management framework interface 700 to communicate using the SNMPprotocol. The standard management framework adapter 704 allows themanagement framework interface 700 to communicate with any othercommunication protocols that might be utilized by standard managementframeworks, such as other product managers and the like. The managementframework interface 700 also includes an enterprise manager 706, adomain group manager 708, and an available domain/resources module 710.During startup of the management framework interface 700 (which istypically associated with an enterprise), the enterprise manager 706will identify all groups within the enterprise. Then, the domain groupmanager 708 will operate to identify all management nodes within each ofthe groups. Thereafter, the available domain/resources module 710 willidentify all domains and resources associated with each of theidentified domains. Hence, the domains and resources for a givenenterprise are able to be identified at startup so that the othercomponents of a manager (e.g., the manager 200) are able to make use ofthe available domains and resources within the enterprise. For example,a GUI can have knowledge of such resources and domains for improved userinteraction with the manager, and the knowledge processor can understandwhich rules within the knowledge base 206 are pertinent to theenterprise.

The management framework interface 700 also includes an incomingnotification manager 712. The incoming notification manager 712 receivesnotifications from the agents within managed nodes. These notificationscan pertain to events that have been monitored by the agents, such as asystem crash or the presence of a new resource. More generally, thesenotifications can pertain to changes to monitored data at the managednodes by the agents.

The management framework interface 700 also includes a managed nodeadministrator module 714. The managed node administrator module 714allows a user or administrator to interact with the management frameworkinterface 700 to alter nodes or domains within the enterprise, such asby adding new nodes or domains, updating domains, reloading domains,etc.

Still further, the management framework interface 700 can also include amanaged node update module 716. The managed node update module 716 candiscover managed nodes and thus permits a manager to recognize andreceive status (e.g., active/inactive) of the managed nodes.

FIG. 8 is a block diagram of a report module 800 according to oneembodiment of the invention. The report module 800 is, for example,suitable for use as the report module 218 illustrated in FIG. 2.

The report module 800 includes a presentation manager 802, a formatconverter 804 and a report view selector 806. The presentation manager802 operates to process the raw report data provided by a log module(e.g., log module 220) in order to present an easily understood, richlyformatted report. Such a report might include associated graphicalcomponents that a user can interact with using a GUI (e.g., GUI 202).Examples of graphical components for use with such reports are buttons,pull-down lists, etc. The format converter 804 can convert the rawreport data into a format suitable for printing and display. The reportview selector 806 allows viewing of partial or complete log data/rawreport data in different ways as selected using a GUI. These views can,for example, includes one or more of the following types of reports: (1)Report Managed nodes wise—show report for the selected managednode/process identifier only; (2) Report time wise—show report for thelast xyz hours (time desired by the user), with the user having theoption of choosing the managed node he wants to view; (3) Report Rulewise—show report for the selected rule that might be applicable fornumber of JVM instances; (4) Report Rule pack wise—show report for allthe rules fired under a particular rule pack; (5) Report Last FiredRules wise—show report for rules fired after last re-start of theinference engine; (6) Report Rule Fired Frequency wise—show report forrules fired as per selected fired frequency (e.g., useful to getrecurrence pattern of event occurrence); (7) Report Domain wise—showreport pertaining to a particular domain (e.g., if a rule is composed ofmultiple domains, in that case this report can show the rules includingthe selected domain. e.g., JVM); (8) Report Resource wise—show reportfor all rules including a particular resource under the domain,e.g.,—jvm_Exception); (9) Report filter wise—show report pertaining torules having similar filter conditions; (10) Report Day wise—show reportfor all events happened in a day; (11) Report Refreshed Values wise—shownext refreshed state of the same report and highlights changed/addedrecords; (12) Report Case ID wise—show the report based on problem caseidentifier (id); and (13) Customized Structure reports—allow user toselect a combination of the above or provide a report filter of theirown.

FIG. 9A is a diagram illustrating a knowledge base 900 according to oneembodiment of the invention. The knowledge base 900 is, for example,suitable for use as the knowledge base 206 illustrated in FIG. 2 or theknowledge base 500 illustrated in FIG. 5A. The architecture for theknowledge base 900 renders the knowledge base 900 well-suited to bemanaged, deployed and scaled. The knowledge base 900 typically resideswithin a manager, such as the manager 200 illustrated in FIG. 2.However, the knowledge base 900 can also be distributed between amanager and managed nodes, such that the processing load can be likewisedistributed.

The knowledge base 900 includes one or more knowledge domains and one ormore rule packs. In particular, the knowledge base 900 illustrated inFIG. 9A includes knowledge domain A 902, knowledge domain B 904 andknowledge domain C 906. Through use of the rule packs, these multipleknowledge domains 902, 904, and 906 can be linked together so as toeffectively operate to concurrently cooperate with one another. Aparticular knowledge domain is a software representation of know-howpertaining to a specific field (or domain). The knowledge domains can bephysical domains and/or virtual domains. A physical domain oftenpertains to a particular managed product. A virtual domain can pertainto a defined set of resources defined by a user to achieve effectivemanageability.

The knowledge base 900 also includes rule packs 910 and 912. These rulepacks (or knowledge rule packs) are collections of rules (i.e.,relationships between different kinds of resources/situations). Thepurpose of the rule packs is to collect the rules such that managementmodification and tracking of knowledge is made easier. By separatingknowledge into domains and rule packs, each knowledge component can beindividually tested as well as tested together with other knowledgecomponents. In other words, each domain or rule pack is a logicallyseparate piece of knowledge which can be installed and uninstalled asdesired.

FIG. 9B is an architecture diagram for a rule pack 914 according to oneembodiment of the invention. The rule pack 914 includes rules 916, facts918 and functions 920. The rule pack 914 depends on the facts 918 forits reasoning, a set of facts that it generates, a set of functions 920that it calls upon, and a set of rules 916 that act to read and writefacts and perform the functions.

When a rule pack is installed, the system must keep track of its rules,functions, inputs and outputs so that a large installed base of rulepacks can be managed. Hence, an individual rule pack can be added to orremoved from the knowledge base without adversely affecting the entiresystem.

Further, two rule packs may operate on the same set of shared facts. Thetwo knowledge rule packs may also generate a set of shared facts. Theserule packs can facilitate the tracking of how a fact travels throughvarious rule packs, and how a fact may be generated by multiple rulepacks. The functions and rules of rule packs can also be more preciselymonitored by using the smaller sized rule packs. It is also possible forone rule to exist in two or more rule packs. Hence, when such two ormore rule packs that share a rule are merged into a knowledge base, onlyone copy of the rule need exist within the knowledge base.

An expert system object manages the knowledge base. For example, theexpert system object can reset an inference engine, load and unload rulepacks or domains, insert or retract runtime facts, etc.

The knowledge representation utilized by the present invention makes useof three major components: facts, rules and actions. Collectively, thesecomponents are utilized to perform the tasks of monitoring and managinga computer resource, such as a JVM, an operating system, a network,database or applications.

FIG. 10 illustrates a relationship 1000 between facts 1002, rules 1004and actions 1006. According to the relationship 1000, facts 1002 triggerrules 1004. The rules 1004 that are triggered cause the actions 1006.The actions 1006 then may cause additional facts to be added to therepository of the facts 1002. A fact can be considered a record ofinformation. One example of a fact is the number of threads running in aJVM. Another example of a fact is an average load on a CPU. Rules arepresented as “if-then” statements. In one embodiment, the left-hand sideof the “if-then” statement can have one or more patterns, and theright-hand side of the “if-then” rule can contain a procedural list ofone or more actions. The patterns are used as conditions to search for afact in the repository of the facts 1002, and thus locate a rule thatcan be used to infer something. The actions are functions that perform atask. As an example, the actions can be considered to be statements thatwould otherwise be used in the body of a programming language (e.g.,JAVA or C programs). As another example, the actions can be used toobtain debug information using a resource.

The rules 1004 can be represented in JAVA Expert Systems Shell (JESS)and as a rule engine that drives these rules. JESS offers a CLIPS-likelanguage for specifying inference rules, facts and functions. Therelationship 1000 thus facilitates the creation of a data-drivenknowledge base that is well-suited for monitoring and managing computerresources.

FIG. 11 illustrates an object diagram 1050 for a representativeknowledge representation. The object diagram 1050 includes a rule pack 1inference object 1052 and a rule pack 2 inference object 1054. Aninference object for a rule pack encompasses the rules written for thatknowledge domain(s) and a rules engine can then read and execute theserules. A JESS package can be utilized to provide this functionality.Surrounding each of the inference objects 1052 and 1054 are domain factsand domain actions. Although the arrangement of the rule packs shown inFIG. 11 is such that the rule packs pertain to a particular domain, rulepacks can also be arranged to pertain to multiple domains.

The relationship between a domain fact and an inference object is alwaysan arrow pointing from the fact to the inference object, therebydenoting that facts are “driving” the rules inside the inference engine.The relationship between the inference object and the actions are thatof an arrow pointing from the inference object toward the action—meaningthe inference rules “drive” the actions. Between the two inferenceobjects 1052 and 1054 are facts and actions that both inference objects1052 and 1054 utilize. In effect, these inference objects 1052 and 1054are cooperative expert systems, namely, expert systems that cooperate ina group by sharing some of their knowledge with one another.

Facts can be used to represent the “state” of an expert system in smallchunks. For example, a fact may appear as “MAIN::jvm-jvm_heapused (v“3166032”) (uid “372244480”) (instance “13219”) (host unknown)” Thecontent of the fact indicates that in the current Java Virtual Machine(JVM) on system “unknown” with instance or process id 13219, the size ofheap used is 3166032 bytes. In this example, uid, instance and host aresome of the attributes of the resource jvm_heapused belonging to thedomain jvm. The attributes of a resource that are not used forcomparison with other resources, need not be included in the facts forthe resource. Facts, as implemented by JESS, exist inside the rulesengine. To add an additional fact into the rules engine, the new fact isinjected into the inference engine object. The repository of facts canbe represented hierarchically. The knowledge base can, for example, besorted and transmitted as needed as a set of XML documents or providedas shared distributed databases using LDAP or as JAVA Properties files.

In the case of a cooperative expert system, access to a shared set offacts is needed. The facts can be logically organized into separatedomains. In one implementation, a user may choose to organize sharedknowledge into separate knowledge rule packs, or alternatively, allowthe same fact definition to exist within multiple rule packs. In thelater approach, the system can manage the consistency of the facts usinga verification process at the managed resource node (in the form ofcapability requests) and at the knowledge control module (in the form ofdefinition verification).

The rules are used to map facts into actions. Rules are preferablydomain-specific such that separate domains of knowledge are thusprovided as modular and independent rule sets. Hence, the modificationof one domain of rules and its internal facts would not affect otherdomains. These different rule packs of rules interact with each otheronly through shared facts.

An example of a rule implemented using JESS is as follows:

(Defrule default—jvm-memory-leak-detect (jvm—jvm_heapused (v ?r1) (uid?uid) (instance ?instance) (host ?host)) (test ( > ?r1 1000000) =>(...some actions...) )

The “default-” prefix denotes the rule pack the rule belongs to. Sinceit is possible that memory leak can exist for application or applicationserver, utilizing separate name spaces for each rule pack of rulesallows separation of these rules into different rule packs. Anotheradvantage of using separate name space for different rule packs is thatJESS rules are serializable, meaning that text rules can be encoded intobinary form. The ability to store rules in binary form serves to protectthe intellectual property encoded within the rules.

Actions are procedural statements to be executed. The actions may resideon the right-hand side of rules in the form of scripts or can beembedded as methods inside programming objects (e.g., JAVA objects). Inthe case of scripts, the scripts are inference engine-dependent suchthat different inference engines would utilize different scripts becauseof the different languages utilized by the inference engines. In thecase of programming objects, the actions are functions. For example,actions in JAVA can be implemented by registering them as new JESSfunctions. Alternatively, the functions could be packaged inside factobjects for which such rules are relevant. The functions could in turnrequest relevant resource values from the managed nodes and assert thevalues obtained as facts into the inference engine. The fact objects(e.g., get values) represent values obtained from agents (e.g., using ascheduler of an agent).

Given that actions can be complicated and not tied to any particularfacts, it is often more efficient to create a global object for a domainand include the methods or functions for actions therein such that everyrule within a rule pack has access to the actions.

Through the use of a modular design, the system becomes easier to manageeven when thousands of rules and facts exist. By separating rules intorule packs and facts into domains, and making it difficult for domainsto interfere with one another, the expert system is effectively dividedinto smaller modular pieces. Additionally, through use of JESS'sbuilt-in watch facility, the system can track those rules that havefired and the order in which they have fired. This watch facility thusprovides a limited tool for debugging a knowledge system. Groups ofrules can be isolated for inspection by turning off other rules. Rulescan be turned off by deactivating those inference objects from firingwhich are not desired. If one were to desire to debug a set of rulesrelated to one domain, such a set of rules could be manually groupedinto a logical group (e.g., rule pack) and user of the management systemcan use GUI 202 to control the activation of each group. Using GUI 202,user can additionally control activation of a single or a selected setof rules within a rule pack.

Initialization scripts can be used to set up all the components neededfor a rule pack. The setup can operate to create the inference object,load the rules, create initial facts, create action objects, and linkall the objects together so that they can inter-operate.

In the JESS/JAVA implementation, one inference object may contain rulesfrom one or more rule packs. Outside the inference object are objectsthat represent facts and objects that encapsulate actions. Eachinference object is attached to a set of facts and actions. The rulesengine searches the facts for matches that can trigger a rule to fire.Once a rule is fired, one or more action objects being linked theretoare invoked. Actions can also be explicitly linked by using aninitialization that involves JAVA object creation and passing handles tothese objects to appropriate JESS inference objects.

One useful aspect of the rule engine design is the ability of the systemto manage different combinations of multiple products on multiple nodesusing one set of rule packs and one manager. This simplifies thedistribution, configuration and manageability of rule packs on per-userbasis. For example, the rules engine can have rule packs for managedproducts JVM and Oracle loaded, but one managed node may not have Oracleas the managed product. In this case, naturally there will be no factscorresponding to Oracle resources for the managed node asserted into theinference engine and hence the rules using those Oracle resources willnot be active for the managed node without Oracle as a managed product.Note that the information about the managed node is part of the factrepresenting any Oracle resource.

Another useful aspect of the rules engine design is the implicitchaining of rules by the inference engine. A user of the system definesindividual rules representing a problem or diagnostic “cases”. Thesystem combines these individual rules based on the use of common factsrepresenting resources. For example, one rule can be, represented in ameta-language, “IF (jvm—uncaught_exception ANDfilter—exception_is_Oracle_connections_exhausted) THEN (getOracle—max_connections_configured)”. A second rule can be, representedin a meta-language, “IF (Oracle—max_connections_configured<50) THEN(email dba)”. When the inference engine is running, if the jvm_uncaughtexception gets asserted into the inference engine and if the assertedfact contains the Oracle_connections_exhausted status, then themanagement system will obtain the Oracle—max_connections_configuredresource from the same managed node as described by the host attributeof the exception resource. On request from the interface engine, thecorresponding fact will be asserted into the inference engine. Theinference engine will now automatically detect the second ruledefinition using the Oracle—max_connections_configured resource and thesecond rule will automatically get into action. It will check if thefact value representing the Oracle—max_connection_configured resource isgreater than 50 and, if so, it will automatically send electronic mailto the dba.

FIG. 12 is a block diagram of the managed node 1200 according to oneembodiment of the invention. The managed node 1200 is, for example,suitable for use as one or more of the managed nodes 102 illustrated inFIG. 1.

The managed node includes a plurality of different managed products1202. In particular, the managed node 1200 includes managed products1202-1, 1202-2, . . . , 1202-n. These managed products 1202 are softwareproducts that form part of the system being managed by a managementsystem. The managed products can vary widely depending uponimplementation. As examples, the managed products can pertain to aSolaris operating system, an Oracle database, or a JAVA application.

The managed node 1200 also includes an agent 1204. The agent 1204couples to each of the managed products 1202. The agent 1204 alsocouples to a manager (e.g., the manager 108 illustrated in FIG. 1) viathe management framework 106. In general, the agent 1204 can interactwith the managed products 1202 such that the managed products 1202 canbe monitored and possibly controlled by the management system via theagent 1204.

Additionally, in one embodiment, one or more of the managed products1202 can include an application agent 1206. For example, as shown inFIG. 12, the managed product N 1202-n includes the application agent1206. Here, the application agent 1206 resides within the process spaceof the managed product N 1202-n (and thus out of the process space ofthe agent 1204). The application agent 1206 can render the managedproduct N 1202-n more manageable by the agent 1204. For example, theapplication agent 1206 can enable any JAVA application to be managed.The capabilities of the application agent 1206 can be further enhancedby the user adding application code to the application agent conformingto the Application Programming Interfaces (API) provided by theapplication agent 1206. This methodology provides a convenient means forthe user to add his/her application specific information such that itbecomes available as resources to the rest of the management system.

FIG. 13 is a block diagram of an agent 1300 according to one embodimentof the invention. The agent 1300 is, for example, suitable for use asthe agent 1204 illustrated in FIG. 12.

The agent 1300 includes a master agent 1302 that couples to a pluralityof sub-agents 1304. In particular, the agent 1300 utilizes N sub-agents1304-1, 1304-2, . . . , 1304-n. Each of the sub-agents 1304-1, 1304-2, .. . , 1304-n respectively communicates with the managed products 1202-1,1202-2, . . . ,1202-n shown in FIG. 12. The master agent 1302 thusinteracts with the various managed products 1202 through the appropriateone of the sub-agents 1304. The master agent 1302 includes the resourcesthat are shared by the sub-agents 1304. These shared resources arediscussed in additional detail below with respect to FIG. 14. The masteragent 1302 also provides an Application Programming Interfaces (API)that can be used by the user to write a sub-agent that can interact witha managed product for which a sub-agent is not provided by themanagement product. Using this API, the user-written sub-agent can makeavailable the managed product specific information as resources to therest of the management product including the master agent 1302 and themanager 108.

The agent 1300 also includes a communication module 1306. Thecommunication module 1306 allows the agent 1300 to communicate with amanagement framework (and thus a manager) through a variety of differentprotocols. In other words, the communication module 1306 allows theagent 1300 to interface with other portions of a management system overdifferent protocol layers. These communication protocols can bestandardized, general purpose protocols (such as SNMP), orproduct-specific protocols (such as HPOV-SPI from Hewlett-PackardCompany) or various other proprietary protocols. Hence, thecommunication module 1306 includes one or more protocol communicationmodules 1308. In particular, as illustrated in FIG. 13, thecommunication module 1306 can include protocol communication modules1308-a, 1308-b, . . . , 1308-m. The protocol A communication module1308-a interfaces to a communication network that utilizes protocol A.The protocol B communication module 1308-b interfaces with acommunication network that utilizes protocol B. The protocol Mcommunication module 1308-m interfaces with a communication network thatutilizes protocol M.

FIG. 14 is a block diagram of a master agent 1400 according to oneembodiment of the invention. The master agent 1400 is, for example,suitable for use as the master agent 1302 illustrated in FIG. 13.

The master agent 1400 includes a request processor 1402 that receives arequest from the communication module 1306. The request is destined forone of the managed products 1202. Hence, the request processor 1402operates to route an incoming request to the appropriate one of thesub-agents 1304 associated with the appropriate managed product 1202.Besides routing a request to the appropriate sub-agent 1304, the requestprocessor 1402 can also perform additional operations, such as routingreturn responses from the sub-agents 1304 to the communication module1306 (namely, the particular protocol communication module 1308 that isappropriate for use in returning the response to the balance of themanagement system, i.e., the manager).

The master agent 1400 typically includes a registry 1404 that storesregistry data in a registry data store 1406. The registry 1404 manageslists which track the sub-agents 1304 that are available for use inprocessing requests for notification to the sub-agents 1304 or theprotocol communication modules 1308. These lists that are maintained bythe registry 1404 are stored as registry data in the registry data store1406. Hence, the registry 1404 is the hub of the master agent 1400 forall traffic and interactions for other system components carried out atthe agent 1300. The functionality provided by the registry 1404 includes(1) a mechanism for sub-agent registration, initialization, and dynamicconfiguration; (2) a communication framework for the sub-agent'sinteraction with the manager node through different communicationmodules present at the agent; (3) a notification mechanism forasynchronous notification delivery from the monitored systems andapplications to the communication modules and the manager node; and (4)a sub-agent naming service so that sub-agents can be addressed by usingsimple, human-readable identifiers. The registry 1404 also acts as aninterface between the communication modules 1308 so that thecommunication modules 1308 are able to configure registered sub-agentsand receive asynchronous notifications from the registered sub-agents.

The master agent 1400 also includes a scheduler 1408 and statisticalanalyzer 1410. The scheduler 1408 can be utilized to schedule requestsin the future to be processed by the request processor 1402. Thestatistical analyzer 1410 can be utilized to process (or at leastpre-process) the response data being returned from the managed product1202 before some or all data is returned to the manager. Hence, byhaving the master agent 1400 perform certain statistical analysis at thestatistical analyzer 1410, the processing load on the manager can bedistributed to the master agents.

Each of the sub-agents 1304 can be a pluggable component enclosingmonitoring and control functionality pertinent to a single system orapplication. The sub-agents 1304 are known to the managed productsthrough the registry 1404. In other words, each of the sub-agents 1304is registered and initialized by the registry 1404 before it can receiverequests and send out information about the managed product it monitors.The principal task of the sub-agent 1304 is to interact with the managedproduct (e.g., system/application) it controls or monitors. Thesub-agent 1304 serves to hide much interaction detail from the rest ofthe agent 1300 and provides only a few entry points for request into theinformation.

The different protocols supported by the communication module 1306 allowthe communication module 1306 to be dynamically extended to supportadditional protocols. As a particular protocol communication module 1308is initialized, the registry 1404 within the master agent 1400 isinformed of the particular protocol communication module 1308 so thatasynchronous notifications from the managed objects can be received andpassed to the manager via the particular protocol communication module1308.

The communication module 1306 receives requests from a manager throughthe protocol supported by the particular protocol communication module1308 that implements and forwards such requests to the appropriatesub-agent 1304 corresponding to the appropriate managed node. Theregistry 1404 within the master agent 1400 is utilized to forward therequest from the protocol communication module 1308 and the sub-agents1304.

In addition, the protocol communication module 1308 also provides acallback for the sub-agents 1304 such that notifications are able to bereceived from the managed product and sent back to the manager. If suchcallbacks are not provided, the notifications will be ignored by thesub-agents 1304 and, thus, no error will be reported to the manager.Hence, each of the protocol communication modules 1308 can be configuredto handle or not handle notifications as desired by any particularimplementation.

FIG. 15 is a block diagram of a sub-agent 1500 according to oneembodiment of the invention. The sub-agent 1500 is, for example,suitable for use as any of the sub-agents 1304 illustrated in FIG. 13.

The sub-agent 1500 includes a get resource module 1502, a set operationmodule 1504, and an event forwarding module 1506. The get resourcemodule 1502 interacts with a managed product to obtain resources beingmonitored by the managed product. The set operation module 1504interacts with the managed product to set or control its operation. Theevent forwarding module 1506 operates to forward events that haveoccurred on the managed product to the manager. In addition, thesub-agent 1500 can further include a statistical analyzer 1508. Thestatistical analyzer 1508 can operate to perform statistical processingon raw data provided by a managed product at the sub-agent level. Hence,although the master agent 1400 may include the statistical analyzer1410, the presence of statistical analyzer 1508 in each of thesub-agents 1500 allows further distribution of the processing load forstatistical analysis of raw data.

FIGS. 16A and 16B are flow diagrams of manager startup processing 1600according to one embodiment of the invention. The manager startupprocessing 1600 initially loads 1602 a knowledge base. The manager is,for example, the manager 200 illustrated in FIG. 2 and includes aknowledge base, such as the knowledge base 206 illustrated in FIG. 2.Once the knowledge base is loaded 1602, third-party managementframeworks are discovered 1604. In one implementation, a managementframework interface, such as the management framework interface 212illustrated in FIG. 2, is utilized to identify and establish aninterface to all available third-party management frameworks. Next, alist of node groups is obtained 1606. In one implementation, the list ofnode groups is retrieved by the management framework interface.

Next, a first node group is selected 1608 from the list of node groups.For the selected node group, a list of nodes within the selected nodegroup is obtained 1610. A decision 1612 then determines whether thereare more node groups to be processed. When the decision 1612 determinesthat there are more node groups to be processed, then the managerstartup processing 1600 returns to repeat the operations 1608 and 1610for a next node group. When the decision 1612 determines that there areno more node groups to be processed, all the nodes within each of thenode groups have thus been obtained.

At this point, processing is performed on each of the nodes. A firstnode from the various nodes that have been obtained is selected 1614.Then, a list of domains within the selected node is obtained 1616. Adecision 1618 then determines whether there are more nodes to beprocessed. When the decision 1618 determines that there are more nodesto be processed, then the manager startup processing 1600 returns torepeat the operations 1614 and 1616 for a next node.

On the other hand, when the decision 1618 determines that there are nomore nodes to be processed, then processing can be performed for each ofthe domains. At this point, the manager startup processing 1600 performsprocessing on each of the domains that have been obtained. In thisregard, a first domain is selected 1620. Then, a list of supportedresources is obtained 1622 for the selected domain. A decision 1624 thendetermines whether all of the domains that have been identified havebeen processed. When the decision 1624 determines that there areadditional domains to be processed, the manager startup processing 1600returns to repeat the operations 1620 and 1622 for a next domain suchthat each domain can be similarly processed.

Next, processing is performed with respect to each of the nodes. At thispoint, a first node is selected 1626. Then, a customized knowledge baseis produced 1628 for the selected node based on the supported resourcesfor the selected node. In other words, the generalized knowledge basethat is loaded 1602 is customized at operation 1628 such that acustomized knowledge base is provided for each node that is active orpresent within the system being managed. A decision 1630 then determineswhether there are more nodes to be processed. When the decision 1630determines that there are more nodes to be processed, then the managerstartup processing 1600 returns to repeat the operations 1626 and 1628for a next node. Alternatively, when the decision 1630 determines thatthere are no more nodes to be processed, then data acquisition for thosebase rules within the customized knowledge bases can be scheduled 1632.Once the data acquisition has been scheduled 1632, the manager startupprocessing 1600 is complete and ends.

FIGS. 16C-16E are flow diagrams of manager startup processing 1650according to another embodiment of the invention. The manager startupprocessing 1650 initially loads 1652 a knowledge base with resources,rule packs and configuration information. The manager is, for example,the manager 200 illustrated in FIG. 2 and includes a knowledge base,such as the knowledge base 206 illustrated in FIG. 2. Once the knowledgebase is loaded 1652, a list of node groups is obtained 1654.

A decision 1656 then determines whether there are any node groups to beprocessed. When the decision 1656 determines that there are node groupsto be processed, then a first node group is selected 1658. Then, a listof nodes within the selected node group is obtained 1660.

Next, a decision 1662 determines whether there are any nodes in theselected node group that are to be processed. When the decision 1662determines that there are nodes within the selected node group to beprocessed, then a first node is selected 1664. Then, for the selectednode, a list of agent types on the selected node is obtained 1668.

A decision 1670 then determines whether there are any agent types to beprocessed. When the decision 1670 determines that there are agent typesto be processed, a first agent type is selected 1671. Then, for theselected agent type, a decision 1672 determines whether there is anythird party framework adapter. When the decision 1672 determines thatthere is no third party framework adapter, then a list of domains isobtained 1674. On the other hand, when the decision 1672 determines thatthere is a third party framework adapter, then a list of supporteddomains is discovered 1676. Here, the resulting list of supporteddomains includes information about product(s) supported by the thirdparty adapter. The concept of domain in this case is adapter-specific.For example, for SNMP adapter, all resources supported by the SNMPmaster agent on a managed node can be considered belonging to a domain.Another concept of domain for SNMP adapter can correspond to theresources supported by every SNMP sub-agent on the managed nodecommunicating with the SNMP master agent

Following the operations 1674 and 1676, a decision 1678 determineswhether there are any domains within the selected agent type. When thedecision 1678 determines that there are domains, then a first domain isselected 1680. Then, a list of supported resources and domain versionare obtained 1682. Next, a decision 1684 determines whether there aremore domains within the selected agent type. When the decision 1684determines that there are more domains, then the manager startupprocessing 1650 returns to repeat the operation 1680 and subsequentoperations so that a next domain can be similarly processed.

Alternatively, when the decision 1684 determines that there are no moredomains within the selected agent type to be processed, as well asdirectly following the decision 1678 when there are no domains to beprocessed, a decision 1686 determines whether there are more agent typesto be processed. When the decision 1686 determines that there are moreagent types to be processed, then the manager startup processing 1650returns to repeat the operation 1671 and subsequent operations so that anext agent type can be similarly processed.

On the other hand, when the decision 1686 determines that there are nomore agent types to be processed, or directly following the decision1670 when there are no agent types to be processed, a decision 1688determines whether there are more nodes to be processed. When thedecision 1688 determines that there are more nodes to be processed, thenthe manager startup processing 1650 returns to repeat the operation 1664and subsequent operations so that a next node can be similarlyprocessed.

Alternatively, when the decision 1688 determines that there are no morenodes to be processed, or directly following the decision 1662 whenthere are no nodes, a decision 1690 determines whether there are morenode groups to be processed. When the decision 1690 determines thatthere are more node groups to be processed, the manager startupprocessing 1650 returns to repeat the operation 1658 and subsequentoperations so that a next node group can be similarly processed.

On the other hand, when the decision 1690 determines that there are nomore node groups to be processed, or directly following the decision1656 when there are no node groups, a customized domain and resourceslist is produced 1692 based on available domains (and their versions)and resources information for rules input. Then, a customized knowledgebase is produced 1694 for the selected nodes based on supported domainsand resources.

A reference resource list can be created using the most-up-to-dateversion of each domain type. The reference resource list is used in ruledefinitions. For example, a JVM domain list of resources obtained fromone managed node may be larger in number than the list of resourcesobtained for the JVM domain from a different managed node. This ispossible because of enhancement of agent 1204 over time. The referenceresource list contains the maximal set of domains and resources from thelatest version of all the knowledge domains by name/type. This enablesuser to define rules for the most complete manageability of the userenvironment 100 (e.g., using one GUI).

Next, a decision 1696 determines whether a knowledge processor has beenselected to run. The decision 1696 enables user to start the managementsystem for development and testing of rules and also, to setup all themanaged nodes and select a set rule packs and rules prior to running theknowledge processor. The decision 1696 can be facilitated by a GUI. Whenthe decision 1696 determines that the knowledge processor is to be run,then data acquisition for those base rules within the customizedknowledge base can be scheduled 1698.

Alternatively, when the decision 1696 determines that the knowledgeprocessor is not selected to run, then the operation 1696 can bebypassed. Following the operation 1696, or its being bypassed, themanager startup processing 1600 is complete and ends.

FIG. 17A is flow diagram of master agent startup processing 1700according to one embodiment of the invention. A managed node includes anagent to assist the management system in monitoring and managing themanaged node. In one embodiment, the agent includes a master agent and aplurality of sub-agents. Hence, the master agent startup processing 1700pertains to startup processing that is performed by a master agent. Themaster agent is, for example, the master agent 1302 illustrated in FIG.13.

The master agent startup processing 1700 initializes 1702 anypre-configured sub-agents for the master agent. Hence, any standardsub-agents for the master agent are initialized 1702. Then, the presenceof any other sub-agents for the master agent are discovered 1704. Theseother sub-agents can be either in-process or out-of-process. Anin-process sub-agent would operate in the same process as the masteragent. On the other hand, an out-of-process sub-agent would operate in aseparate process from that of the master agent. After the any othersub-agents are discovered 1704, the discovered sub-agents areinitialized 1706. A statistical analyzer can then be activated 1708 foreach of the sub-agents. The statistical analyzers provide the statisticscollection for the resources being monitored by the respectivesub-agents. Following the operation 1708, the master agent startupprocessing 1700 is complete and ends.

FIG. 17B is a flow diagram of sub-agent startup processing 1750according to one embodiment of the invention. The sub-agent startupprocessing 1750 is performed by a sub-agent. For example, the sub-agentcan be one of the sub-agents 1304 illustrated in FIG. 13.

The sub-agent startup processing 1750 initially establishes 1752 aconnection with the master agent. The connection is an interface or acommunication link between the master agent and the sub-agent.Application resources are then discovered 1754. The applicationresources are those resources that are available from an applicationmonitored by the sub-agent. The application resources can also includeuser-defined resources, e.g., using an API. Next, the master agent isnotified 1756 of the status of the sub-agent. The status for thesub-agent can include various types of information. For example, thestatus of the sub-agent might include the resources that are availablefrom the sub-agent, details about the version or operability of thesub-agent, etc. Next, a statistical analyzer can be activated 1758 forthe sub-agent. The statistical analyzer allows the sub-agent to performstatistical analysis on resource information available from thesub-agent. Following the operation 1758, the sub-agent startupprocessing 1750 is complete and ends. It should, however, be recognizedthat the sub-agent's startup processing 1750 is performed for each ofthe sub-agents associated with the master agent.

FIGS. 18A and 18B are flow diagrams of trigger/notification processing1800 according to one embodiment of the invention. Thetrigger/notification processing 1800 is, for example, performed by amanager, such as the manager 108 illustrated in FIG. 1. In particular,the trigger/notification processing 1800 operates to trigger processingso that management information can be recorded and utilized, includinginitiation of notifications as appropriate.

The trigger/notification processing 1800 begins with a decision 1802that determines whether a new fact has been asserted. When the decision1802 determines that a new fact has not been asserted, then a decision1804 determines whether a notification has been received. Here, thenotifications could arrive from managed nodes. When the decision 1804determines that a notification has not been received, then thetrigger/notification processing 1800 returns to repeat the decision1802. Once the decision 1802 determines that a new fact has beenasserted or when the decision 1804 determines that a notification hasbeen received, then a fact is asserted 1806 in the inference engine. Theinference engine then processes the fact in the manager. For example, inthe case of the manager 200 illustrated in FIG. 2, the inference engineis implemented by the knowledge processor 208. Next, a log entry is made1808 into a log. The log entry indicates at least that the fact wasasserted 1806.

Next, updated facts are retrieved 1810 for one or more rules that aredependent upon the asserted fact. Hence, the inference engine receivesthe asserted fact and determines which of the rules are dependent uponthe asserted fact, and then for such rules, requests updated facts sothat the rules can be fully and completely processed using up-to-dateinformation.

Following the operation 1810, a decision 1812 determines whether thetrigger/notification processing 1800 should stop. When the decision 1812determines that the trigger/notification processing 1800 should stop,then those facts no longer needed are discarded 1813. Following theoperation 1813, the trigger/notification processing 1800 is complete andends. For example, a user might terminate the operation of the managerand thus end the trigger/notification processing 1800.

Alternatively, when the decision 1812 determines that thetrigger/notification processing 1800 should not stop, then additionalprocessing is performed depending upon the type of resource. Forexample, the resource or the rule being processed can signal for dataacquisition, corrective action or debug operations. In particular, adecision 1814 determines whether data acquisition is requested. When thedecision 1814 determines that data acquisition has been requested, thenan updated fact is selected 1816. On the other hand, when the decision1814 determines that data acquisition is not being requested, then adecision 1818 determines whether corrective action is indicated. Forexample, a rule within the knowledge base can request a correctiveaction be performed. In any case, when the decision 1818 determines thata corrective action has been requested, then the corrective action isperformed 1820.

Alternatively, when the decision 1818 determines that a correctiveaction is not being requested, then a decision 1822 determines whetherdebug data is being requested. When the decision 1822 determines thatdebug data is requested, then debug data is obtained 1824.

Alternatively, when the decision 1822 determines that debug data is notbeing requested, then a decision 1828 determines whether a user-definedsituation has occurred. When the decision 1828 determines that auser-defined situation has occurred, then an action 1830 is taken notingthe occurrence of the user-defined situation.

Following any on the operations 1816, 1820, 1824, 1830 or the decision1828 when a user-defined situation is not present, a log entry is made1826 into the log. The log entry indicates the firing of the rule alongwith the specifics of the resources (including their values) on theleft-hand-side (or “if” part of the rule). Following the loggingoperation 1826, the trigger/notification processing 1800 returns torepeat the operation 1806 and subsequent operations so that additionalfacts can be asserted and similarly processed.

Additionally, a user of the management system may interact with aGraphical User Interface (GUI) to request a report. The report providesinformation to the user about the management state of the one or moremanaged products within the enterprise or computer system beingmonitored.

FIG. 19 is a flow diagram of GUI report processing 1900 according to oneembodiment of the invention. The GUI report processing 1900 is, forexample, performed by a manager. For example, the manager can be themanager 200 illustrated in FIG. 2.

The GUI report processing 1900 can begin with a decision 1902 thatdetermines whether a report has been requested. When the decision 1902determines that a report has not yet been requested, the GUI reportprocessing 1900 awaits such a request. In other words, the GUI reportprocessing 1900 can be considered to be invoked once a report requesthas been received. In any case, when the decision 1902 determines that areport request has been received, then log data is retrieved 1904. Forexample, with respect to the manager 200 illustrated in FIG. 2, the logdata can be retrieved 1904 from the log module 220. After the log datais retrieved 1904, a report is generated 1906 from the retrieved logdata.

The report might indicate the various facts and rules that have beenutilized by the management system over a period of time. For example, areport might specify those of the rules that were “fired” and for eachsuch rules, when it “fired,” why it “fired,” and action (if any) taken.Additionally, a report might include details on the actions taken andrelated values. Still further, if one of the actions taken is a debugaction, then the report might also include debug data. A report can alsobe targeted or selective in its content based on criteria. For example,a report can be limited with respect to one or more of a certain timerange, an event, exceptions, domains and/or rule packs.

Once the report has been generated 1906, a report delivery method isdetermined 1908. Here, the report delivery method can be pre-configuredby an administrator of the management system to deliver reports tocertain individuals or locations automatically. For example, the reportcan be delivered in the form of a notification that can be carried outusing a pager, a voice mail, a voice synthesized telephone call, afacsimile, etc. Once the report delivery method has been determined1908, the report is delivered 1910 using the determined report deliverymethod. It should be understood that the report delivery method can varydepending upon the nature of the report. For example, urgent reports canutilize one or more delivery methods that are more likely to reach therecipient immediately, such as a page or a mobile telephone call. Hence,the report can be delivered in a variety of different ways dependingupon the application, circumstances and configuration of the managementsystem. Following the delivery 1910 of the report, the GUI reportprocessing 1900 is complete and ends.

FIGS. 20-29 are screen shots of a representative Graphical UserInterface (GUI) suitable for use with one embodiment of the presentinvention. These screen shots detail how to create and maintain rulesusing the GUI.

How to Build a Rule Using Resources

To add (create) a rule, a user would access an Add New Rule page, suchas shown in FIG. 20. Here, the user would perform the first step of foursteps to follow in order to add a new rule. Namely, the user would entera name and description for the rule and select a rule pack it belongsto. Upon pressing a Submit button, the process proceeds to the next stepwhere you define the situation or the left-hand side of a rule, i.e. theconditions under which the rule will fire. Or, in other words, a list ofsituations and events (When this happens . . . ) which lead to theactions specified under the “Then define situation or do this . . . ”header, which is referred to as the right-hand side of the rule.Predicates of the left-hand side are called antecedents and elements ofthe right-hand side are called consequents.

As shown in FIG. 21, to build the left-hand side of a rule, first choosea knowledge domain from a Domains list on the left side of the screen.After a domain is selected from the list the selection box below will beshow all resources of that domain. There are two kinds of domains,physical and special (or virtual). A physical domain represents acollection of resources pertaining to a software component or an entiresoftware product, for instance the Java Virtual Machine, as opposed to aspecial, or virtual domain. A special domain represents a set ofresources, which aren't associated with any “physical” knowledge domain.Instead such resources are used by the manager as building blocks toexpress conditions of the left-hand side or form actions on theright-hand side of a rule. In the representative rule being built, botha physical domain resource and a virtual domain resource are used.First, select the jvm domain from the list of domains and two resourcesof that domain to the right-hand side of the rule (see FIG. 21).

Once we have selected all the resources used to define the situation,the “proceed to next step” button is selected. The next step is whererelationships between the selected resources and/or their thresholds areset to configure the condition for the rule to fire. Now, add acondition to the left-hand side of the rule. This condition basicallystates that when the amount of heap memory currently in use is greaterthan a certain percentage of the maximum heap memory available, the ruleshould fire. In order to add a condition to the left-hand side of arule, choose the Filter special domain. As shown in FIG. 22, one of thedomain resources in the selection box will be Condition. The user justselects “Condition” and clicks the add button.

Next, an Edit Parameter button for the condition is selected and thedesired condition expression entered. Here, the condition expressionshown entered in FIG. 23 binds the two JVM resources. The condition istypically defined as an expression. A simple example of a conditionexpression is (a>b).

Let us look at detail how we came up with the condition expression inFIG. 23. Please refer to FIG. 22 for better illustration. Under the“When this happens . . . ” header note that there are three distinctentries one below the other as follows—

-   -   ?r1 jvm_HeapUsed    -   ?r2 jvm_MaxHeapSize    -   ?r3 Condition

Here ?r1, ?r2 and ?r3 are resource variable names assigned by the systemto the resources jvm_HeapUsed, jvm_MaxHeapSize and Condition resourcesrespectively. This is to facilitate the definition of the conditionexpression using the resource variable names only. A simplified exampleof a condition expression using resource ?r1 is (?r1>1000000), whichstates that the rule is considered true (or, gets “fired”) in case jvmHeapUsed exceeds 1000000 bytes or 1 MB. Note that, in this expression?r1 and 1000000 are operands and > is a comparator operator in betweenthe two operands.

In the condition expression ?r1>(?r2*060) in FIG. 23, the conditionstates that the rule is considered to be true if JVM heap beingcurrently used (jvm_HeapUsed), ?r1, is greater than 60% of (or 0.60times) the maximum allowed heap size (jvm_MaxHeapSize), ?r2.

Now, as the left-hand side of the rule has been built, let us specifyusing the Configure Action(s) page shown in FIG. 24 to indicate what wewant the system to do when the condition becomes true. Let's request thesystem produce a report on the class whose objects occupy most of theJVM heap and request a report on objects of the classes thus identifiedare allocated on the heap during the following 15 seconds.

Setting Up a Rule for Auto-Diagnostics

In order to test the rule that has been created (and also make sure thatall components of the products are installed properly and communicatewith each other), the manager should be set so that it considers therule when the rule evaluation engine is started. Every rule can beconfigured in a flexible way. For instance, it can be set to be testedevery 10 seconds, or every minute, or every hour. If you want a trialrun of the rule as you run the engine, select a special option on thelist of possible intervals, “once only,” can be chosen. The testinginterval can be set on the same Rule Editing page as shown in FIG. 25.

Chaining of Rules

The rule shown in FIG. 24 is a rule that defines conditions for anabnormal situation. If the defined situation occurs, the system isrequested to take one or more actions. In this representative example,the actions are the two request for jvm_TopHeapObjects andjvm_AllocTrace on the right-hand side of the rule, under the “Thendefine situation or do this . . . ” header. This kind of rule is useful,but its capabilities are limited. If instead of taking action rightthere in the rule, a situation is defined, then another rule can bebuilt so that it gets triggered when this situation has beenencountered. Through this mechanism, rules can be chained andhierarchies, or trees, of rules can be built.

For example, for this rule to be turned into a rule that can potentiallybe chained to other rules, a new situation has to be defined, see FIG.26. The situation can then be added to the rule as a consequent , seeFIG. 27.

Thereafter, as desired, another rule or a set of rules can be definedwith JVMLowMemory as the antecedent and the system will automaticallychain these rules, i.e., the set of rules defined with JVMLowMemory onthe left-hand side of the rule, will fire when the situation in FIG. 27is declared in the modified rule in FIG. 24.

Editing Rules

A previously defined (added) rule can be edited. To edit an existingrule, go to the Rule Management page, such as shown in FIG. 28, selectan existing rule and click on the Edit button.

Starting and Stopping the Rule Engine

After a rule or a chain of rules has been created the system is ready tomonitor the software on the managed nodes. In order to initiate thisprocess, from the Rule Management page, start the rule engine byclicking on the (Re)Start Engine button. If the rules engine has to bestopped, press the Stop Engine button in the Rule Management page. Ifany of the rules were edited or new rules were added and you want thesechanges to take effect, the (Re)Start Engine button in the RuleManagement page has to be pressed. This will cause the engine to stop,automatically pick up any changes that have been made, and restart.

Note that every time the manager process is started, the Rule Enginestatus can be Ready. The current status of the engine is displayed inthe top right hand corner in the Rule Management page. For the rules tobe fired according to time and condition set in its definition, the(Re)Start Engine button in the Rule Management page needs to be pressedexplicitly. This changes the status of Rule Engine from Ready toRunning. You have to do this every-time you add or make changes to rulesand want the Rule Engine to pick up the additions/changes. As the enginegets into the running state, it checks resource values of the rules setup for periodic checking. In case all conditions on the left-hand sideof such rule become valid, the engine will proceed with the actions onthe on the right-hand side of the rule, after which the rule will becomeblocked for as long as the conditions are valid. Then, the rule will bemarked active again. All activities of the engine in respect to rulefiring and subsequent actions are reflected on the Report page. The pagecan be accessed through the Report button on the Rule Management page,such as shown in FIG. 28.

Report

The Report page for our example above, with heap usage reduced to 1% andallocation tracing time reduced to 5 seconds, is shown in FIG. 29. TheReport page has several functional buttons which are self-descriptive: aRefresh button is used for updates of the page so it reflects the latestreport information, a Clear button will render the report page empty, aMail button will allow the report to be sent via e-mail and the Donebutton will take you back to the main page, the Rule Management page.

The sample report shown in FIG. 29 is a result of running of the ruledefined and shown in FIG. 24. The report reflects all important eventsassociated with the system having run with the rule being activated fordiagnostics. The first line of the report indicates that rule JVMHeapwas fired and for what system the conditions of the rule became true andwhen it happened. Then values of the resources on the left-hand side ofthe rule, which led to the rule being triggered are shown. Under theActions taken header the resources of the right-hand side are shown.First, the list of the classes whose objects take up most of the spaceon the JVM heap is requested. Filters excluding all standard classes(java.*, javax.*) are applied so that only two classes appear on thelist. This is because the application run by our JVM is truly simple.The second action is a 15 second allocation trace report for objects ofthe classes found on the top heap objects list. Under jvm_AllocTrace youcan see all allocations of objects of the two classes. Each allocationtrace shows where, in what method of what class, it took place. It alsoshows the line number in the source code for that class, if available(such would be available when the source code was compiled withoutdisabling the debugging information generation).

The invention can be implemented in software, hardware, or a combinationof hardware and software. The invention can also be embodied as computerreadable code on a computer readable medium. The computer readablemedium is any data storage device that can store data which can bethereafter be read by a computer system. Examples of the computerreadable medium include read-only memory, random-access memory, CD-ROMs,magnetic tape, and optical data storage devices. The computer readablemedium can also be distributed over a network coupled computer systemsso that the computer readable code is stored and executed in adistributed fashion.

The many features and advantages of the present invention are apparentfrom the written description, and thus, it is intended by the appendedclaims to cover all such features and advantages of the invention.Further, since numerous modifications and changes will readily occur tothose skilled in the art, it is not desired to limit the invention tothe exact construction and operation as illustrated and described.Hence, all suitable modifications and equivalents may be resorted to asfalling within the scope of the invention.

What is claimed is:
 1. A non-transitory computer-readable mediumincluding at least computer program code stored therein for managing anenterprise computer system, the enterprise computer system beingconfigured to operate a plurality of different software products, saidcomputer-readable medium comprising: computer program code for receivinga fact pertaining to a condition of at least one of the plurality ofdifferent software products that are operating in the enterprisecomputer system; computer program code for asserting the fact to aninference engine, the inference engine using rules based on facts, therules are obtained from a knowledge base that stores the rules as wellas resources associated with the plurality of different softwareprograms; computer program code for retrieving at least one updated factfrom the inference engine based on at least one rule from those of therules stored in the knowledge base that are dependent on the fact thathas been asserted; computer program code for initiating an action inview of the at least one updated fact; computer program code fordiagnosing a software problem at the enterprise computer system due toat least one of the plurality of different software programs operatingat the enterprise computer system, using the inference engine and the atleast one rule from the knowledge base; and computer program code formaking log entries to store log data in a log, wherein at least one ofthe log entries pertains to at least the fact that has been asserted,wherein at least one of the log entries pertains to the at least oneupdated fact, wherein at least one of the log entries pertains to theaction being initiated, and wherein at least one of the log entriespertains to debug data.